Data Wrangling: The process of getting the data ready for analysis, including: accessing data, reformatting in appropriate ways (format, orientation), creating variables, recoding values, and selecting variables and/or observations of interest for analysis.
How you process and recode data is critical! Principles that should guide us include:
Replication: Can others do what you did? Art depends on personality, science depends on replication.
Understanding: Can others understand what you did and why? Can you follow what you did if you come back to the code in a year?
We want to do things that are understandable and sensible. Always imagine that you are handing the code off to someone else – can they understand what you are doing? Can they follow what you have done to the data? The goal in data wrangling is to provide a set of steps that will take you from the raw data to the data being analyzed in clear and understandable steps so that others can see exactly the choices that you have made (and make other choices if desired to see the robustness of your inferences).
Data wrangling in R gives us the ability to do this because we can document and explain the exact steps we are taking to organize the data. Using code to make changes means anyone and everyone can observe what we did.
In contrast, if you make changes to your data in a spreadsheet like Excel or Google Sheets or Numbers it can be impossible for someone else to figure out what you did. What variables did you rename, what transformations did you do, how did you recode values?
People using Excel for analysis have been identified as being (partially) responsible for what is widely believed to be a failed global economic policy.
And using Excel for data entry has resulted in well-documented errors in Medical Research due to “auto-correct” changing value names.
In fact, Dr. Jennifer Byrne - named a top 10 person in science in 2017 because of her data sleuthing in 2017 has developed an entire program to find and correct errors produced by data entry in Excel! Link here.
So the goal of this topic is to give you some basic tools to be able to manipulate data and get it ready for analysis. This includes inspecting your data and getting a sense of the type of data you have.
When you start out (in this class) it is always a good idea to have a clean – i.e., empty – workspace so that your analysis is based on the code that you are doing rather than some manipulations that were done previously. Do not be one of those students at the end of the class who has every dataset they worked with in class saved and loaded in their Environment.
If you are starting out with an empty workspace your R Studio should say the “Environment is empty” in the Environment tab. it is always a good idea to always start with an empty environment because the data manipulations we are doing will not be cumulative.
To that end, when you quit RStudio, DO NOT save workspace image. This will ensure that you are always starting with a clean and empty workspace which means that there is never a possibility of using “old” data in your analyses. If you are working on a larger project across several sessions it sometimes makes sense to save it, but for this class you should get in the practice of writing self-contained code that reads and wrangles the data you need to analyses every time rather than relying on earlier work.
When starting out it can be easy to forget that R needs to be told where to look to find things!
For every assignment/lecture create a new folder
on your computer (e.g., “Topic 3”).
Download the code and data into this folder. (Or move them from the Download folder to that folder). Or create a new RMarkdown file and save the code to this folder.
In the Console window of RStudio figure out where R thinks it is
by looking at the title bar or using getwd (get working
directory)!
getwd()
## [1] "/Users/rweldziu/Library/CloudStorage/Dropbox/Villanova/TEACHING/PSC7000 - Research Concepts and Approaches/PSC7000_F2024/Homeworks"
If this is different than where your code and data is there will be trouble!
Because R is object-oriented, we can actually save this as an object and use it!
mydir <- getwd()
setwd (set working
directory) and the location we just saved to the object
mydir!setwd(mydir)
We can also use RStudio to accomplish the same task.
Open your code in R-Studio.
The GUI will automatically change the directory, but you should also copy and paste the resulting code snippet in Console to your code!
Warning!
What R “sees” when working in the Console window (or using an R
Script) is different from what it “sees” using RMarkdown! You can
Knit a RMarkdown document to conduct analysis without any
object ever appearing in the Global Environment.
Quick Exercise Clear out the Global Environment and
Knit this document. Confirm that the code will run and
nothing will be created in the Global Environment.
While this makes sense once the code/project is completed, it can make coding more difficult because we often want to be able to manipulate the code and objects while working. To do so we are going to write and evaluate chunks one-at-a-time to ensure the objects are created in the Global Environment.
So getting started we always start our code by:
tidyverse library and any other libraries
we are going to use.library(tidyverse)
setwd to tell R where
in your computer it should be looking for additional data (and
code).getwd()
setwd()
Import Dataset. If you are loading in data that is already
formatted in R then you go to the RStudio menu:
File > Open File.... (Note that this menu also has the
File > Import Dataset option.)
Because we want to document what dataset we are using, for replicability we want to copy and paste the resulting code into our RMarkdown file so that next time we can run the code directly!
A second way to access data is to use some functions directly in R.
This is what the GUI is actually doing behind the scenes. We are going
to focus on three functions in this class: - load -
read_rds - url
A tidyverse function we will sometimes use is
read_rds (Read R Data Structure). This will load a single
object into memory. Here we are reading in a subset of the 2020 Michigan
Exit Poll and naming it the tibble mi_ep.
mi_ep <- read_rds("../data/MI2020_ExitPoll_small.rds")
The first part of the code is defining a new object called
mi_ep (short for Michigan exit poll) and then assigning
(<-) that object to be defined by the dataset being read
in by the read_rds function.
To highlight the syntax, consider what happens when you run the following code?
read_rds("../data/MI2020_ExitPoll_small.rds")
If we read in data without assigning it to an object R will simply
print the results to the console window. Nothing will be saved and we
cannot use the data in future analyses! When reading in an R data object
using read_rds we always need to define a new object
associated with that file!
Before moving on, let’s clean up our Environment – always a good idea
between lectures and assignments! – using either the broom icon in the
Environment GUI to delete every object or the rm (remove)
command to remove specific objects. Since we only have one object in our
Environment so far we can use rm to remove it.
rm(mi_ep)
A nice thing about R is that it can have multiple objects in the
Global Environment at the same time. This allows you to work with
multiple datasets and objects. Instead of loading them all into memory
one at a time using read_rds, we can also use the
load function to load several R objects that have been
saved into a single file (using the save function) all at
once.
In the file named MI2020_ExitPoll.Rdata I have
previously saved the entire 2020 Michigan Exit Poll (the object named
MI_final) as well as a subset of the exit poll that focuses
on variables we are going to analyze in class (named
MI_final_small). To give you a chance to do your own
analyses, I have saved both files together so you can access either
one.
Note that when we are loading an .Rdata
file we do not need to assign it to an object because
the objects are already defined within the .Rdata object.
You can see this when you use the objects function to print
to the console window the objects that are loaded into memory (which
will also match the Global Environment list when we run the code
chunk-by-chunk):
load(file = "../data/MI2020_ExitPoll.Rdata")
objects()
## [1] "MI_final" "MI_final_small" "mydir"
Relatedly, because we are loading several objects at once the name of the file will not be the same as the name of the objects being loaded.
NOTE: R will look for this file relative to where it is (as is given
by getwd). Here I have created a data folder
located within the current directory and telling R to look for my data
in that folder!
We can also read in non-R files. For example, we can read in
comma-delimited files (read.csv), tab-delimited files
(read.delim), Excel-created files (read.xls,
read.xlsx), and files from other statistical languages
(e.g., read.dta, read.sav). Reading in non-R
objects/data requires reading in the data and also assigning the data to
a new R object.
tidyverse are called
tibbles. Why? As with most things, it seems Twitter is to
blame…tibbles can be thought of as a matrix with rows and
columns and we can extract the dimensions of the data using the
dim (dimensions) command where the output is a vector of
length two consisting of the number of rows – here 1231 and then the
number of columns 14.dim(MI_final_small)
## [1] 1231 14
In most of the work that we will do in DSCI 1000 rows typically refer to observations – e.g., survey respondents – and columns refer to characteristics of those observations (e.g., survey responses) which we call variables.
We can use matrix indexing to examine/extract specific rows and columns of tibbles/dataframes. To select the first three rows we would use:
MI_final_small[1:3,]
## # A tibble: 3 × 14
## SEX AGE10 PRSMI20 PARTYID WEIGHT QRACEAI EDUC18 LGBT BRNAGAIN LATINOS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2 1 3 0.405 1 4 NA NA 2
## 2 2 10 1 1 1.81 2 1 2 1 2
## 3 2 7 1 1 0.860 1 5 2 2 2
## # ℹ 4 more variables: RACISM20 <dbl>, QLT20 <fct>, preschoice <chr>,
## # Quality <chr>
To select the first three columns…
MI_final_small[,1:3]
## # A tibble: 1,231 × 3
## SEX AGE10 PRSMI20
## <dbl> <dbl> <dbl>
## 1 2 2 1
## 2 2 10 1
## 3 2 7 1
## 4 1 9 1
## 5 2 8 1
## 6 2 7 1
## 7 1 9 1
## 8 1 8 1
## 9 2 6 2
## 10 1 8 1
## # ℹ 1,221 more rows
Selecting the first two rows and first four columns
MI_final_small[1:2,1:4]
## # A tibble: 2 × 4
## SEX AGE10 PRSMI20 PARTYID
## <dbl> <dbl> <dbl> <dbl>
## 1 2 2 1 3
## 2 2 10 1 1
Selecting a collection/combination of rows and all columns we would use:
MI_final_small[c(1:2,45,100),]
## # A tibble: 4 × 14
## SEX AGE10 PRSMI20 PARTYID WEIGHT QRACEAI EDUC18 LGBT BRNAGAIN LATINOS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2 1 3 0.405 1 4 NA NA 2
## 2 2 10 1 1 1.81 2 1 2 1 2
## 3 2 7 1 3 0.488 1 3 NA NA 2
## 4 2 10 2 4 0.299 1 2 2 1 2
## # ℹ 4 more variables: RACISM20 <dbl>, QLT20 <fct>, preschoice <chr>,
## # Quality <chr>
Introduce some basic tools and techniques to answer a series of questions relevant for politics, political science, society, etc.
Think about the data we have available data and its’ limitations!
Enable you to be the next political-media star.
To do our data wrangling we are going to wrangle the 2020 National Exit Poll from the National Election Pool in the state of Michigan.
We are going to use the actual data we got on Election Night!
To answer some of the same questions we were answering.
But not much has been cleaned up since then so lots of work to do! (Ugh…)
Exit polls are polls that are done on (or immediately before) elections in the United States. They are used to help interpret the meaning of an election – i.e., why voters voted the way they did, whether some types of voters were more likely to support one candidate over the other, etc. – but intelligent prognosticators do not use them to actually project who is going to win. Put differently, they are used to help intepret an election, but they are not used to predict it (which is what voting data is used for).
In class we are goign to focus on an Exit Poll from the National Election Pool for the state of Michigan. Michigan is increasingly thought of as a ``swing state” that could be won by either a Democrat or a Republican in a presidential contest following is surprising support for President Trump in the 2016 presidential election. Michigan is also a rather diverse state in terms of its population and interests and some have worked to identify groups of like-minded voters within the state. (Note that we will also do this when we get to the “clustering” topic!)
The Exit Poll data we will analyze is based on the following questionaire that reports the precise questions that were asked, as well as the value labels associated with each response. So while the data that we will read in will have responses that are coded as a “1” or “3” or “4”, interpreting the meaning of those values requires comparing the values to the questions below.
Predictive: Use the data to predict an outcome of interest.
Descriptive: Use the data to describe an event.
THINKING: Is any of it causal? Can we determine what causes a voter to support a candidate from this data?
Whenever we do anything using data we should first inspect our data
to make sure it makes sense. The function glimpse gives us
a quick summary by printing the first few observations of every variable
to the screen. Note that the presentation of the data is flipped
(technically transposed) so that the columns are presented as rows.
glimpse(MI_final_small)
## Rows: 1,231
## Columns: 14
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2,…
## $ AGE10 <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1,…
## $ WEIGHT <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 9, 1, 1, 1, 1, 1, 3, 1, 1, 1,…
## $ EDUC18 <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 5, 5, 4, 1, 1, 1, 5, 2, 4, 2,…
## $ LGBT <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 2, 2,…
## $ BRNAGAIN <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, NA, NA, 2, NA, 2, 1,…
## $ LATINOS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ RACISM20 <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 9, 4,…
## $ QLT20 <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…
This is a useful representation as we can immediately see what
variables we have as well as what some of the values are. Here, for
example, we see that we have missing data – denoted as NA
in R - in quite a few variables. We also see that some variables have
numbers for values while others have strings of letters (e.g.,
QLT20, preschoice).
This variation highlights the fact that there are several types of data:
<dbl> Double. “Numbers as a number.” Numbers
stored to a high level of scientific precision. Mathematical operations
are defined. (At least in theory!) e.g., SEX
<int> Integer. “Numbers as a number.”
Mathematical operations are defined. (At least in theory!) R treats
<dbl> and <int> as largely
interchangeable.
<chr> Character. A variable with letter and/or
number values. Mathematical operations are not defined, but
other functions exist (e.g., extract the first and last characters,
etc.) e.g., preschoice
<fct> Factor. A variable defining group
membership. Mathematical operations are not defined, but they
can be used in special ways in R. e.g. QLT20. Note how the
values of a character variable like preschoice are in
quotes while the values of a factor variable like QLT20 are
not. Factor variable can “look” like numeric or character variables in
that the values they take on may be numbers or letters (or both), but R
interprets them differently than numeric and character variables and
they can be used to do special things.
NOTE: There are also list objects, but we will cover
them when needed.
A second way to get a sense of the data is to use the
summary command which will report the quantiles for any
integer or numeric variable, the count of every value for a factor
variable, and a note if the variable is a character variable. Unlike
glimpse which gives you a rough approximation of how much
the data may vary, summary quantifies the variation of each
numeric variable in a bit more detail.
summary(MI_final_small)
## SEX AGE10 PRSMI20 PARTYID
## Min. :1.00 Min. : 1.000 Min. :0.00 Min. :1.000
## 1st Qu.:1.00 1st Qu.: 6.000 1st Qu.:1.00 1st Qu.:1.000
## Median :2.00 Median : 8.000 Median :1.00 Median :2.000
## Mean :1.53 Mean : 8.476 Mean :1.63 Mean :2.236
## 3rd Qu.:2.00 3rd Qu.: 9.000 3rd Qu.:2.00 3rd Qu.:3.000
## Max. :2.00 Max. :99.000 Max. :9.00 Max. :9.000
##
## WEIGHT QRACEAI EDUC18 LGBT
## Min. :0.1003 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:0.3775 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :0.8020 Median :1.000 Median :3.000 Median :2.000
## Mean :1.0000 Mean :1.572 Mean :3.288 Mean :2.224
## 3rd Qu.:1.4498 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :5.0853 Max. :9.000 Max. :9.000 Max. :9.000
## NA's :615
## BRNAGAIN LATINOS RACISM20
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.000 Median :2.000
## Mean :1.907 Mean :2.175 Mean :2.325
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :9.000 Max. :9.000 Max. :9.000
## NA's :615 NA's :615
## QLT20 preschoice Quality
## [DON'T READ] Don’t know/refused: 26 Length:1231 Length:1231
## Can unite the country :125 Class :character Class :character
## Cares about people like me :121 Mode :character Mode :character
## Has good judgment :205
## Is a strong leader :138
## NA's :616
##
This is useful because it can reveal potential issues that we need to
deal with. If we look at the maximum value for AGE10 for
example, we can see that it has a value of 99 even though
the Exit Poll questionaire indicates that the largest value should only
be a 10 (for those selecting “75 and over”). This is
because respondents who skipped that question were coded as having a
value of 99. We also see that in other variables (e.g.,
PARTYID) there are values of 9 even though
there is no value associated with that in the questionannire. These are
again missing data!
Recall that this is the actual data that was used and we we see right
away that while some variables have explicit missing data (e.g.,
LGBT, BRNAGAIN, RACISM20,
QLT20) others have missing data that is not recognized as
missing by R because of how it was coded. Moreover, the code for missing
data varies between variables in this case! A 9 indicates
missing data in PARTYID but it means something real in
AGE10 (which uses 99 to denote missing
data).
There are several functions depending on the type of data. For data
that takes on discrete values – either numeric or character – it can be
helpful to see the distribution of values that occur within a variable.
To do so we are going to get a count of each unique value
associated with a variable in a tibble.
In the code that follows we are using the tidyverse pipe
command %>% which you should interpret as meaning “and
then”. Using piping (%>%), we can count the
variable preschoice within the tibble
MI_final. This is what we will do as we are going to pipe
multiple commands to accomplish our intended tasks. Note that the
default reporting is to arrange the rows from lowest to smallest
according to the values of the variable being counted
(here preschoice which is being reported in alphabetical
order).
MI_final_small %>%
count(preschoice)
## # A tibble: 6 × 2
## preschoice n
## <chr> <int>
## 1 Another candidate 25
## 2 Donald Trump, the Republican 459
## 3 Joe Biden, the Democrat 723
## 4 Refused 14
## 5 Undecided/Don’t know 4
## 6 Will/Did not vote for president 6
Note that here we are printing creating a table that is printed to the console window and disappears.
Quick Exercise Can you create a tibble containing a table of vote choice?
# INSERT CODE HERE
If we try to count several variables what R will do is to present the
count within each ordered value. So the following code will first
count each value of preschoice and then count
how that breaks down according to the values given in the variable
SEX.
MI_final_small %>%
count(preschoice,SEX)
## # A tibble: 12 × 3
## preschoice SEX n
## <chr> <dbl> <int>
## 1 Another candidate 1 17
## 2 Another candidate 2 8
## 3 Donald Trump, the Republican 1 247
## 4 Donald Trump, the Republican 2 212
## 5 Joe Biden, the Democrat 1 304
## 6 Joe Biden, the Democrat 2 419
## 7 Refused 1 7
## 8 Refused 2 7
## 9 Undecided/Don’t know 1 3
## 10 Undecided/Don’t know 2 1
## 11 Will/Did not vote for president 1 1
## 12 Will/Did not vote for president 2 5
This is an important reminder to always inspect your data and then you often, if not always, need to wrangle data before doing analysis!
Quick Exercise How many Democrats and Republicans are in our sample?
# INSERT CODE HERE
Note that count is not useful for many-valued
(continuous) variables because each value is likely to occur only a
small number of times.
MI_final %>%
count(WEIGHT)
## # A tibble: 411 × 2
## WEIGHT n
## <dbl> <int>
## 1 0.100 1
## 2 0.113 1
## 3 0.119 1
## 4 0.133 2
## 5 0.141 1
## 6 0.142 1
## 7 0.144 1
## 8 0.146 1
## 9 0.147 1
## 10 0.149 5
## # ℹ 401 more rows
Here the summary function is more useful for describing
the variation in our data. In addition to summarizing the entire dataset
we can also select a specific variable and use
summary.
MI_final %>%
select(WEIGHT) %>%
summary(WEIGHT)
## WEIGHT
## Min. :0.1003
## 1st Qu.:0.3775
## Median :0.8020
## Mean :1.0000
## 3rd Qu.:1.4498
## Max. :5.0853
Once we load a dataset into active memory and take a look to get a sense of what we have we often want to be able to focus on particular variables or observations. Not every variable or observation will be valuable for the analyses that we do and we want the ability to extract the relevant data either for future analyses (e.g., creating a new tibble with just the relevant data) or for the analysis that we want to do right now (e.g., perform a calculation using a specific variable in the tibble on a subset of the observations). If we are interested in the how respondents who self-identify as “born again” reported voting in 2020 in Michigan, for example, we only need to analyze the presidential vote choice of born-again voters.
For the data in this class, columns are variables and rows are observations. That is, each row is a unique data point and each column is a description of one feature of that data point. When doing analysis we often want to focus on observations with particular features – e.g., voters from a particular state in a nationwide survey.
selectTo begin, let’s create a new tibble called MI_small by
extracting/selecting 4 variables from MI_final. If we have
a large dataset it can be useful to create a smaller dataset to save
computer memory and speed processing time. This is not a concern for any
of the data we will use, but it is a useful illustration. Beyond
creating new tibbles that are a subset of existing tibbles, the other
use of select is to extract particular variables for
analysis within the tibble.
In the code that follows we are using the tidyverse pipe
command %>% which you should interpret as meaning “and
then”. So, in English we would read this code as: “MI_Small is defined
to be MI_final and then select the variables SEX, AGE10, PRSMI20,
PARTYID and then glimpse the results”.
MI_small <- MI_final_small %>%
select(SEX,AGE10,PRSMI20,PARTYID) %>%
glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ AGE10 <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1, 8,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…
What this code does is to create a new tibble called
MI_small that consists of the variables SEX,
AGE10, PRSMI20, and PARTYID from
the MI_final tibble and then prints a glimpse of the new
tibble to the screen so we can check to confirm that it did what we
wanted it to do. While MI_final had a dimension of 1231,
63, the new tibble MI_small has a dimension of 1231, 4.
We can also drop variables from a tibble by negatively selecting
them. To drop AGE10 we just “subtract” the variable from
the selection as follows:
MI_small %>%
select(-AGE10) %>%
glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…
Note that I am piping through the glimpse to confirm
that the code is doing what I think it is doing. A large number of
coding and analysis mistakes are caused by differences in what the data
scientists thinks the data is relative to what the data
actually is so it is usually a good ideal to confirm that you
have done what you think you have done
Quick Exercise Create a new tibble called
MI_small_2 that contains the variables EDUC18,
PARTYID, and SEX.
# INSERT CODE HERE
We can also select variables based on features of the
variable names. To select all variables that start with the letter “P”,
for example, we would use:
MI_small %>%
select(starts_with("P")) %>%
glimpse()
## Rows: 1,231
## Columns: 2
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…
Why would we ever do this? If you are working with a large number of variables it can sometimes be useful to have uniform naming conventions (e.g., have all demographic variables start with the prefix “d_” as in “d_age”).
We can also select variables that end with (or do not end with) a
particular set of values. The code below selects all variables that
do not – hence the ! that tells R to do
the opposite of the function (i.e., “does not”) – end in a
0.
MI_small %>%
select(!ends_with("0")) %>%
glimpse()
## Rows: 1,231
## Columns: 2
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1, 1,…
These functions are admittedly a bit specific and we won’t really use
them much in class, but it is good to know the functionality exists.
Note that we are not limited to a single character, we could select
variables that starts_with("PAR") or
ends_with("20").
We can also select a range of variables using the sequence notation
: (all values between) that takes all variables between the
two variables– including the two variable. For example, to select all
variables between SEX and PRSMI20 in the
tibble we could do the following.
MI_final_small %>%
select(SEX:PRSMI20) %>%
glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1,…
## $ AGE10 <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1, 8,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1,…
Before reading the next sentence, can you think why this is not good coding practice? Answer: it makes the code depend on the arrangement of the variables in your data such that a different arrangement of variables will produce differences in output. We always want our code to be replicable and it is therefore desirable to use coding practices that are not going to be affected by reorderings that do not change the underlying data.
arrangeWhen working with the rows of a tibble it can sometimes be useful to
rearrange their order. The exit poll data we are working with, for
example, has no natural ordering – the order of the data is the order in
which it was collected. But we may want to rearrange the data to sort it
in increasing (or decreasing) order according to selected variables. To
do so we can use the arrange function to sort the tibble
according to the values of the specified variables. To rearrange the
MI_small tibble by SEX (from smallest value to
largest value) we would use:
MI_final_small %>%
arrange(SEX) %>%
glimpse()
## Rows: 1,231
## Columns: 14
## $ SEX <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10 <dbl> 9, 9, 8, 8, 9, 10, 5, 9, 8, 4, 1, 1, 6, 7, 7, 4, 99, 9, 10,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 0, 2,…
## $ PARTYID <dbl> 3, 1, 1, 1, 3, 2, 4, 1, 3, 3, 3, 1, 1, 4, 3, 4, 3, 1, 3, 2,…
## $ WEIGHT <dbl> 0.1991648, 1.3713049, 1.1540513, 1.3000787, 0.1469731, 0.84…
## $ QRACEAI <dbl> 1, 1, 1, 2, 9, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 5, 1, 1, 1,…
## $ EDUC18 <dbl> 4, 3, 3, 4, 5, 5, 1, 1, 5, 2, 4, 1, 2, 5, 2, 2, 9, 2, 5, 1,…
## $ LGBT <dbl> NA, 2, 2, NA, NA, NA, 2, NA, 2, NA, 2, 1, 2, NA, NA, NA, NA…
## $ BRNAGAIN <dbl> NA, 1, 2, NA, NA, NA, 2, NA, 1, NA, 2, 2, 2, NA, NA, NA, NA…
## $ LATINOS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20 <dbl> NA, 2, 2, NA, NA, NA, 2, NA, 4, NA, 3, 1, 2, NA, NA, NA, NA…
## $ QLT20 <fct> Has good judgment, NA, NA, Cares about people like me, Is a…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality <chr> "Has good judgment", NA, NA, "Cares about people like me", …
Note now how all of the values for SEX being glimpsed
consist of the value 1. Also note that we have
not saved this rearrangement - we have just rearranged
the tibble. How would you save the rearrangement?
The default is to sort from smallest to largest, but to sort from
largest to smallest we need to tell R to take the variable in descending
(desc) order as follows.
MI_small %>%
arrange(desc(SEX)) %>%
glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ AGE10 <dbl> 2, 10, 7, 8, 7, 6, 1, 10, 8, 9, 7, 8, 9, 1, 10, 6, 10, 5, 7, 9…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 9,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3, 4,…
Quick Exercise Sort the new tibble MI_small_2
you created by education level and save the sorted data.
# INSERT CODE HERE
We can also sort using multiple variables. If we arrange using several variables R will process them sequentially — sort by the first variable, then sort again within each of the sorted variables according to the values of the second variable, and so on. So if we wanted to sort by age by gender in ascending order we would use:
MI_small %>%
arrange(SEX, AGE10) %>%
glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ PRSMI20 <dbl> 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 9, 8, 1,…
## $ PARTYID <dbl> 3, 1, 3, 2, 2, 4, 4, 2, 2, 2, 3, 1, 3, 4, 4, 1, 3, 3, 3, 9, 4,…
So you can now see that the data is sorted by AGE10
within each value of SEX. There is no limit to the number
of sorts we can do, or to whether we sort by ascending or descending
order.
As a test of what we have done so far, can you predict what the following code will produce?
MI_small %>%
select(SEX,AGE10,PARTYID) %>%
arrange(SEX, AGE10, desc(PARTYID)) %>%
glimpse()
## Rows: 1,231
## Columns: 3
## $ SEX <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ AGE10 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ PARTYID <dbl> 4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 1, 9, 9, 4, 4, 3,…
To be honest, we don’t really do much with arrange other
than if we are inspecting the data visually or if we are making the
output a bit more sensible.
Recall that the count function produces a tibble that is
sorted according to the values of the variable being counted. This is
not always sensible as we may want to sort the tibble according to the
most-frequently occuring value in the data. If we wanted to produce a
table of values sorted in descending order so that the top row was the
most frequently occurring value we could arrange after
piping thru a count.1
MI_final_small %>%
count(preschoice) %>%
arrange(desc(n))
## # A tibble: 6 × 2
## preschoice n
## <chr> <int>
## 1 Joe Biden, the Democrat 723
## 2 Donald Trump, the Republican 459
## 3 Another candidate 25
## 4 Refused 14
## 5 Will/Did not vote for president 6
## 6 Undecided/Don’t know 4
filterSo far all we have done is to rearrange the observations (rows) in a
tibble. If we want to extract particular observations then we need to
use the filter function.
To select all the male respondents (i.e., SEX takes on
the value of “1”), we could use filter as follows:
MI_final_small %>%
filter(SEX == 2) %>%
glimpse()
## Rows: 652
## Columns: 14
## $ SEX <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ AGE10 <dbl> 2, 10, 7, 8, 7, 6, 1, 10, 8, 9, 7, 8, 9, 1, 10, 6, 10, 5, 7…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3,…
## $ WEIGHT <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1772090, 0.4921975, 1.50…
## $ QRACEAI <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2,…
## $ EDUC18 <dbl> 4, 1, 5, 5, 3, 4, 4, 1, 2, 2, 4, 2, 3, 2, 4, 3, 2, 5, 4, 4,…
## $ LGBT <dbl> NA, 2, 2, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, 2, 2, NA, 2, …
## $ BRNAGAIN <dbl> NA, 1, 2, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, 2, 1, NA, 9, …
## $ LATINOS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20 <dbl> NA, 2, 2, NA, 2, NA, NA, 9, 2, NA, NA, 3, NA, 1, 2, NA, 2, …
## $ QLT20 <fct> Has good judgment, NA, NA, Cares about people like me, NA, …
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality <chr> "Has good judgment", NA, NA, "Cares about people like me", …
When we glimpse the results we can see that every value for
SEX is a 2 as requested.
Note the syntax: we use == to denote “is equal to” and
we use the value 2 because it is a numeric.
We can also obviously combine filter with
select to extract a subset of variables and observations as
follows:
MI_final_small %>%
select(SEX,PARTYID,preschoice) %>%
filter(SEX == 2) %>%
glimpse()
## Rows: 652
## Columns: 3
## $ SEX <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 2, 4, 1, 1, 3, 2, 1, 1, 1, 1, 1, 3, 3, 1, 3,…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
If we wanted to select observations for based on a character or
factor variable the syntax would be somewhat different because of the
difference in a numeric and character value. To select only respondents
who supported Joe Biden using the preschoice variable, for
example, we would use quotes to denote that the value we are filtering
is a character:
MI_final_small %>%
filter(preschoice == "Joe Biden, the Democrat") %>%
glimpse()
## Rows: 723
## Columns: 14
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2,…
## $ AGE10 <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 8, 1, 9, 10, 4, 1, 8, 1, 6, 9, 1, …
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 1, 4, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ WEIGHT <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ EDUC18 <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 1, 1, 2, 4, 2, 1, 2, 3, 2, 4,…
## $ LGBT <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, 2, NA, 2, 2, 1, 2, N…
## $ BRNAGAIN <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, 2, NA, 2, 2, 2, 2, N…
## $ LATINOS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ RACISM20 <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, 9, NA, 3, 2, 1, 2, N…
## $ QLT20 <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…
Many times we want to select observations (and/or variables) depending on whehter or not several conditions are satisfied. To do so we use the fact that:
& (AND) selects if all conditions are
true.
| (OR) selects if any condition is
true
So if we are considering condition A and condition B, filtering based
on & will select observations for which both A and B
are satisfied (i.e., the intersection) whereas | will
select observations for which either A or B are satisfied (i.e., the
union). Note that there is no limit to the number of conditions we can
use and we can also combine them.
A | B means either condition A or B is satisfied.A & B means that both conditions A and B are
satisfied.A | B | C means that either condition A OR condition B
OR condition C are satisfied.(A & B) | C means that either conditions A and B OR
condition C is satisfied.A & (B | C) means that condition A AND conditions B
or C are satisfied.** SELF-TEST: If A = self-identified Democrat, B = self-identified Republican, and C = self-identified Independent how would you interpret each of the conditions above in terms of which respondents would be filtered?
Let’s see how to implement this in R. According to the values coded in the Michigan Exit Poll we see that:
SEX: 1 = Male, 2 = FemaleAGE10: 1 = “under 24”, … , 9 = “64-75”, 10 = “75+”To select the self-reported vote choices of females under the age of
24 we would want to select observations from individuals who are both
female and under the age of 24. To do so, we use the
following (note that we are selecting to focus on the most
relevant variables as practice – the code also works without this
step):
MI_final_small %>%
select(SEX,AGE10,preschoice) %>%
filter(SEX == 2 & AGE10 == 1) %>%
glimpse()
## Rows: 17
## Columns: 3
## $ SEX <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ AGE10 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
OK, but it is hard to know what to take away from this. How about pulling together content from above to produce the following?
MI_final_small %>%
filter(SEX == 2 & AGE10 == 1) %>%
count(preschoice) %>%
arrange(desc(n))
## # A tibble: 3 × 2
## preschoice n
## <chr> <int>
## 1 Joe Biden, the Democrat 15
## 2 Another candidate 1
## 3 Donald Trump, the Republican 1
Note that here we have dropped the select code because
the point of our code is summarize the distribution of self-reported
voting behavior, not produce a new tibble for subsequent analysis.
Because we are using the count function applied to the
variable preschoice the piping will automatically select
the relevant variable.
Notice the difference in the nature and dimensions of the resulting tibbles from these two code snippets. The first code snippet produced a tibble with 17 observations and 3 columns – it is a new data set we can analyze (if we had saved it!). But the second code snippet is a tibble that consists only of 3 observations and 2 columns because it is counting the number of times that each of the 3 unique values occur among the filtered set of respondents.
Quick Exercise Replicate the analysis using male respondents
(i.e., SEX==1)? What do you observe about the number and
distribution of vote choice?
# INSERT CODE HERE
In addition to selecting cases for which both conditions are
satisfied we can also select conditions for which either condition A or
condition B is satisfied. If we want to focus on voters who voted for
either Biden or Trump – and ignore those who self-reported voting for
some other candidate – we can filter the data
accordingly:
MI_final_small %>%
filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
glimpse()
## Rows: 1,182
## Columns: 14
## $ SEX <dbl> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2,…
## $ AGE10 <dbl> 2, 10, 7, 9, 8, 7, 9, 8, 6, 8, 9, 10, 1, 5, 9, 10, 8, 4, 1,…
## $ PRSMI20 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1,…
## $ PARTYID <dbl> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, 1,…
## $ WEIGHT <dbl> 0.4045421, 1.8052619, 0.8601966, 0.1991648, 0.1772090, 0.49…
## $ QRACEAI <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 9, 1, 1, 1, 1, 1, 3, 1, 1, 1,…
## $ EDUC18 <dbl> 4, 1, 5, 4, 5, 3, 3, 3, 4, 4, 5, 5, 4, 1, 1, 1, 5, 2, 4, 2,…
## $ LGBT <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 2, 2,…
## $ BRNAGAIN <dbl> NA, 1, 2, NA, NA, 2, 1, 2, NA, NA, NA, NA, NA, 2, NA, 2, 1,…
## $ LATINOS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ RACISM20 <dbl> NA, 2, 2, NA, NA, 2, 2, 2, NA, NA, NA, NA, NA, 2, NA, 9, 4,…
## $ QLT20 <fct> Has good judgment, NA, NA, Has good judgment, Cares about p…
## $ preschoice <chr> "Joe Biden, the Democrat", "Joe Biden, the Democrat", "Joe …
## $ Quality <chr> "Has good judgment", NA, NA, "Has good judgment", "Cares ab…
Note that this produces a new tibble that could be used for
subsequent analysis containing only respondents who report voting for
either Biden or Trump. Because the number of rows decreases from
1231 in MI_Final_small to 1182
after the filter we have just performed, we can determine that
49 respondents reported voting for a candidate other than
Biden or Trump.
To reiterate how piped functions can change the meaning of a tibble, consider how the size and content of the code we just ran compares to the following:
MI_final_small %>%
filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
count(preschoice) %>%
arrange(desc(n))
## # A tibble: 2 × 2
## preschoice n
## <chr> <int>
## 1 Joe Biden, the Democrat 723
## 2 Donald Trump, the Republican 459
Again, once we pipe (%>%) thru the count
function to the filtered tibble, the tibble changes from being organized
by observations (1182 x 14) to being organized by the number of unique
values in the variable being counted (2). Moreover, the meaning of the
columns changes from being variables associated with each observation to
being the number of observations taking on each value.
Always know what your tibble looks like and what it contains!
NOTE: We can also use conditionals when selecting
variables. For example:
& (AND)MI_final %>%
select(SEX & starts_with("P")) %>%
glimpse()
## Rows: 1,231
## Columns: 0
| (OR)MI_final %>%
select(SEX | starts_with("P")) %>%
glimpse()
## Rows: 1,231
## Columns: 4
## $ SEX <hvn_lbl_> 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, …
## $ PRSMI20 <hvn_lbl_> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, …
## $ PARTYID <hvn_lbl_> 3, 1, 1, 3, 3, 3, 1, 1, 2, 1, 3, 2, 4, 4, 1, 1, 3, 3, 3, …
## $ PHIL3 <hvn_lbl_> 2, 2, 1, 9, 1, 2, 9, 2, 3, 2, 3, 3, 1, 3, 9, 2, 2, 1, 3, …
Using piping (%>%), we count the
variable preschoice within the tibble
MI_final. This is what we will do as we are going to pipe
multiple commands to accomplish our intended tasks.
MI_final_small %>%
count(preschoice)
## # A tibble: 6 × 2
## preschoice n
## <chr> <int>
## 1 Another candidate 25
## 2 Donald Trump, the Republican 459
## 3 Joe Biden, the Democrat 723
## 4 Refused 14
## 5 Undecided/Don’t know 4
## 6 Will/Did not vote for president 6
Note that here we are printing creating a table that is printed to the console window and disappears.
If we wanted to save it for later we could assign this to a new object and then manipulate it.
MI_prescount <- MI_final_small %>%
count(preschoice)
To access this I can then manipulate it. Note that calling
MI_prescount directly reveals that it contains two
variables – a variable of value labels with the same name as the
variable that was counted (here preschoice) and a variable
of the number of observations associated with each value (here
n).
MI_prescount
## # A tibble: 6 × 2
## preschoice n
## <chr> <int>
## 1 Another candidate 25
## 2 Donald Trump, the Republican 459
## 3 Joe Biden, the Democrat 723
## 4 Refused 14
## 5 Undecided/Don’t know 4
## 6 Will/Did not vote for president 6
You can then access this using all of the tools we have already
talked about. For example, to get the number of Biden supporters we can
select the relevant row using filter via
MI_prescount %>%
filter(preschoice == "Joe Biden, the Democrat")
## # A tibble: 1 × 2
## preschoice n
## <chr> <int>
## 1 Joe Biden, the Democrat 723
Quick Exercise Extract the number of respondents who chose either Biden or Trump.
# INSERT CODE HERE
Consider what happens if we instead do:
MI_final_small %>%
select(preschoice) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1231
What did we do ?!?! Now we have selected the variable
preschoice and then counted up the number of observations.
Before, we were counting the variable preschoice which was
counting each value.
Several of our variables were only asked of half the sample (the 2020 Michigan exit poll used two different sets of questionaires to try to ask more questions) and it is important to account for that when thinking about the amount of data we have and what might be possible.
For example, let’s see why respondents reported voting for a candidate.
MI_final_small %>%
count(Quality)
## # A tibble: 6 × 2
## Quality n
## <chr> <int>
## 1 Can unite the country 125
## 2 Cares about people like me 121
## 3 Has good judgment 205
## 4 Is a strong leader 138
## 5 [DON'T READ] Don’t know/refused 26
## 6 <NA> 616
So here we see based on the number of NA that nearly
half of the respondents do not have a valid reply. We can use the
drop_na function to remove missing data that R
recognizes (more on this next lecture!). Note that the responses
also contain an actual value for missing data, but we will deal with
that later.
If we want to drop all of the missing observations in
Quality we can use the following – note that
drop_na is essentially applying a filter to remove missing
data.
MI_final_small %>%
drop_na(Quality) %>%
count(Quality)
## # A tibble: 5 × 2
## Quality n
## <chr> <int>
## 1 Can unite the country 125
## 2 Cares about people like me 121
## 3 Has good judgment 205
## 4 Is a strong leader 138
## 5 [DON'T READ] Don’t know/refused 26
We can include multiple variables in this list. Note that if we do not supply a list of variables it will default to the entire data set! Given that some questions are only asked to half of the sample, why does the following code produce a tibble with 0 rows?
MI_final_small %>%
drop_na() %>%
count(Quality)
## # A tibble: 0 × 2
## # ℹ 2 variables: Quality <chr>, n <int>
This highlights the importance of knowing what your data looks like
in terms of missing data and also why it can be important to use
select to first remove the variables of most interest. You
do not want to remove data because of missingness in variables that you
do not care about! As we will see, some functions have a built-in
parameter to deal with missing data (e.g., mean) while
others do not so always know your data!
Univariate is pretty much what it sounds like: one variable. When undertaking univariate data analysis, we need first and foremost to figure what type of variable it is that we’re working with. Once we do that, we can choose the appropriate use of the variable, either as an outcome or as a possible predictor.
We’ll be working with data from every NBA player who was active during the 2018-19 season.
Here’s the data:
require(tidyverse)
nba<-read_rds("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/nba_players_2018.Rds")
This data contains the following variables:
| Name | Definition |
|---|---|
| namePlayer | Player name |
| idPlayer | Unique player id |
| slugSeason | Season start and end |
| numberPlayerSeason | Which season for this player |
| isRookie | Rookie season, true or false |
| slugTeam | Team short name |
| idTeam | Unique team id |
| gp | Games Played |
| gs | Games Started |
| fgm | Field goals made |
| fga | Field goals attempted |
| pctFG | Percent of field goals made |
| fg3m | 3 point field goals made |
| fg3a | 3 point field goals attempted |
| pctFG3 | Percent of 3 point field goals made |
| pctFT | Free Throw percentage |
| fg2m | 2 point field goals made |
| fg2a | 2 point field goals attempted |
| pctFG2 | Percent of 2 point field goals made |
| agePlayer | Player age |
| minutes | Minutes played |
| ftm | Free throws made |
| fta | Free throws attempted |
| oreb | Offensive rebounds |
| dreb | Defensive rebounds |
| treb | Total rebounds |
| ast | Assists |
| blk | Blocks |
| tov | Turnovers |
| pf | Personal fouls |
| pts | Total points |
| urlNBAAPI | Source url |
We’re interested in the following questions:
To answer these questions we need to look at the following variables:
We’re going to go through a pretty standard set of steps for each variable. First, examine some cases. Second, based on our examination, we’ll try either a plot or a table. Once we’ve seen the plot or the table, we’ll think a bit about ordering, and then choose an appropriate measure of central tendency, and maybe variation.
It’s really important to understand the types of variables you’re working with. Many times analysts are indifferent to this step particularly with larger datasets. This can lead to a great deal of confusion down the road. Below are the variable types we’ll be working with this semester and the definition of each.
A continuous variable can theoretically be subdivided at any arbitrarily small measure and can still be identified. You may have encountered further subdivision of continuous variables into “interval” or “ratio” data in other classes. We RARELY use these distinctions in practice. The distinction between a continuous and a categorical variable is hugely consequential, but the distinction between interval and ratio is not really all that important in practice.
The mean is the most widely used measure of central tendency for a continuous variable. If the distribution of the variable isn’t very symmetric or there are large outliers, then the median is a much better measure of central tendency.
A categorical variables divides the sample up into a set of mutually exclusive and exhaustive categories. Mutually exclusive means that each case can only be one, and exhaustive means that the categories cover every possible option. Categorical is sort of the “top” level classification for variables of this type. Within the broad classification of categorical there are multiple types of other variables.
an ordered categorical variable has– you guessed it– some kind of sensible order that can be applied. For instance, the educational attainment of an individual: high school diploma, associates degree, bachelor’s degree, graduate degree– is an ordered categorical variable.
Ordered categorical variables should be arranged in the order of the variable, with proportions or percentages associated with each order. The mode, or the category with the highest proportion, is a reasonable measure of central tendency, but with fewer than ten categories the analyst should generally just show the proportion in each category.
An ordered binary variable has just two levels, but can be ordered. For instance, is a bird undertaking its first migration: yes or no? A “no” means that the bird has more than one.
The mean of a binary variable is exactly the same thing as the proportion of the sample with that characteristic. So, the mean of a binary variable for “first migration” where 1=“yes” will give the proportion of birds migrating for the first time.
An ordered binary variable coded as 0 or 1 can be summarized using the mean which is the same thing as the proportion of the sample with that characteristic.
An unordered categorical variable has no sensible ordering that can be applied. Think about something like college major. There’s no “number” we might apply to philosophy that has any meaningful distance from a number we might apply to chemical engineering.
Unlike an ordered variable, an unordered categorical variable should be ordered in terms of the proportions falling into each of the categories. As with an unordered variable, it’s best just to show the proportions in each category for variables with less than ten levels. The mode is a reasonable single variable summary of an unordered categorical variable.
This kind of variable has no particular order, but can be just binary. A “1” means that the case has that characteristics, a “0” means the case does not have that characteristic. For instance, whether a tree is deciduous or not.
An unordered binary variable coded as 0 or 1 can also be summarized by the mean, which is the same thing as the proportion of the sample with that characteristic.
In R, categorical variables CAN be stored as text, numbers or even logicals. Don’t count on the data to help you out– you as the analyst need to figure this out.
We probably need to talk about factors.
I’m going to walk you through how an analyst might typically decide what type of variables they’re working with. It generally works like this:
The first thing we’re going to do with any dataset is just to take a quick look. We can call the data itself, but that will just show the first few cases and the first few variables. Far better is the glimpse command, which shows us all variables and the first few observations for all of the variables. Here’s a link to the codebook for this dataset:
The six variables we’re going to think about are field goals, free throw percentage, seasons played, rookie season, college attended, and conference played in.
glimpse(nba)
## Rows: 530
## Columns: 37
## $ namePlayer <chr> "LaMarcus Aldridge", "Quincy Acy", "Steven Adams", …
## $ idPlayer <dbl> 200746, 203112, 203500, 203518, 1628389, 1628959, 1…
## $ slugSeason <chr> "2018-19", "2018-19", "2018-19", "2018-19", "2018-1…
## $ numberPlayerSeason <dbl> 12, 6, 5, 2, 1, 0, 0, 0, 0, 0, 8, 5, 4, 3, 1, 1, 1,…
## $ isRookie <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE…
## $ slugTeam <chr> "SAS", "PHX", "OKC", "OKC", "MIA", "CHI", "UTA", "C…
## $ idTeam <dbl> 1610612759, 1610612756, 1610612760, 1610612760, 161…
## $ gp <dbl> 81, 10, 80, 31, 82, 10, 38, 19, 34, 7, 81, 72, 43, …
## $ gs <dbl> 81, 0, 80, 2, 28, 1, 2, 3, 1, 0, 81, 72, 40, 4, 80,…
## $ fgm <dbl> 684, 4, 481, 56, 280, 13, 67, 11, 38, 3, 257, 721, …
## $ fga <dbl> 1319, 18, 809, 157, 486, 39, 178, 36, 110, 10, 593,…
## $ pctFG <dbl> 0.519, 0.222, 0.595, 0.357, 0.576, 0.333, 0.376, 0.…
## $ fg3m <dbl> 10, 2, 0, 41, 3, 3, 32, 6, 25, 0, 96, 52, 9, 24, 6,…
## $ fg3a <dbl> 42, 15, 2, 127, 15, 12, 99, 23, 74, 4, 280, 203, 34…
## $ pctFG3 <dbl> 0.2380952, 0.1333333, 0.0000000, 0.3228346, 0.20000…
## $ pctFT <dbl> 0.847, 0.700, 0.500, 0.923, 0.735, 0.667, 0.750, 1.…
## $ fg2m <dbl> 674, 2, 481, 15, 277, 10, 35, 5, 13, 3, 161, 669, 1…
## $ fg2a <dbl> 1277, 3, 807, 30, 471, 27, 79, 13, 36, 6, 313, 1044…
## $ pctFG2 <dbl> 0.5277995, 0.6666667, 0.5960347, 0.5000000, 0.58811…
## $ agePlayer <dbl> 33, 28, 25, 25, 21, 21, 23, 22, 23, 26, 28, 24, 25,…
## $ minutes <dbl> 2687, 123, 2669, 588, 1913, 120, 416, 194, 428, 22,…
## $ ftm <dbl> 349, 7, 146, 12, 166, 8, 45, 4, 7, 1, 150, 500, 37,…
## $ fta <dbl> 412, 10, 292, 13, 226, 12, 60, 4, 9, 2, 173, 686, 6…
## $ oreb <dbl> 251, 3, 391, 5, 165, 11, 3, 3, 11, 1, 112, 159, 48,…
## $ dreb <dbl> 493, 22, 369, 43, 432, 15, 20, 16, 49, 3, 498, 739,…
## $ treb <dbl> 744, 25, 760, 48, 597, 26, 23, 19, 60, 4, 610, 898,…
## $ ast <dbl> 194, 8, 124, 20, 184, 13, 25, 5, 65, 6, 104, 424, 1…
## $ stl <dbl> 43, 1, 117, 17, 71, 1, 6, 1, 14, 2, 68, 92, 54, 22,…
## $ blk <dbl> 107, 4, 76, 6, 65, 0, 6, 4, 5, 0, 33, 110, 37, 13, …
## $ tov <dbl> 144, 4, 135, 14, 121, 8, 33, 6, 28, 2, 72, 268, 58,…
## $ pf <dbl> 179, 24, 204, 53, 203, 7, 47, 13, 45, 4, 143, 232, …
## $ pts <dbl> 1727, 17, 1108, 165, 729, 37, 211, 32, 108, 7, 760,…
## $ urlNBAAPI <chr> "https://stats.nba.com/stats/playercareerstats?Leag…
## $ n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ org <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, N…
## $ country <chr> NA, NA, NA, "Spain", NA, NA, NA, NA, NA, NA, NA, "S…
## $ idConference <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, …
Let’s start by taking a look at field goals. It seems pretty likely that this is a continuous variable. Let’s take a look at the top 50 spots.
nba%>% ## Start with the dataset
select(namePlayer,slugTeam,fgm)%>% ## and then select a few variables
arrange(-fgm)%>% ## arrange in reverse order of field goals
print(n=50) ## print out the top 50
## # A tibble: 530 × 3
## namePlayer slugTeam fgm
## <chr> <chr> <dbl>
## 1 James Harden HOU 843
## 2 Bradley Beal WAS 764
## 3 Kemba Walker CHA 731
## 4 Giannis Antetokounmpo MIL 721
## 5 Kevin Durant GSW 721
## 6 Paul George OKC 707
## 7 Nikola Vucevic ORL 701
## 8 LaMarcus Aldridge SAS 684
## 9 Damian Lillard POR 681
## 10 Karl-Anthony Towns MIN 681
## 11 Donovan Mitchell UTA 661
## 12 D'Angelo Russell BKN 659
## 13 Klay Thompson GSW 655
## 14 Stephen Curry GSW 632
## 15 DeMar DeRozan SAS 631
## 16 Russell Westbrook OKC 630
## 17 Buddy Hield SAC 623
## 18 Blake Griffin DET 619
## 19 Nikola Jokic DEN 616
## 20 Tobias Harris MIN 611
## 21 Kyrie Irving BOS 604
## 22 Devin Booker PHX 586
## 23 Joel Embiid PHI 580
## 24 CJ McCollum POR 571
## 25 Julius Randle NOP 571
## 26 Andre Drummond DET 561
## 27 Kawhi Leonard TOR 560
## 28 LeBron James LAL 558
## 29 Jrue Holiday NOP 547
## 30 Montrezl Harrell LAC 546
## 31 Ben Simmons PHI 540
## 32 Anthony Davis NOP 530
## 33 Zach LaVine CHI 530
## 34 Jordan Clarkson CLE 529
## 35 Trae Young ATL 525
## 36 Bojan Bogdanovic IND 522
## 37 Pascal Siakam TOR 519
## 38 Collin Sexton CLE 519
## 39 Jamal Murray DEN 513
## 40 Deandre Ayton PHX 509
## 41 Luka Doncic DAL 506
## 42 Khris Middleton MIL 506
## 43 De'Aaron Fox SAC 505
## 44 Andrew Wiggins MIN 498
## 45 Kyle Kuzma LAL 496
## 46 Mike Conley MEM 490
## 47 Lou Williams LAC 484
## 48 Steven Adams OKC 481
## 49 Rudy Gobert UTA 476
## 50 Clint Capela HOU 474
## # ℹ 480 more rows
So what I’m seeing here is that field goals aren’t “clumped” at certain levels. Let’s confirm that by looking at a kernel density plot.
nba%>%
ggplot(aes(x=fgm))+
geom_density()
We can also use a histogram to figure out much the same thing.
nba%>%
ggplot(aes(x=fgm))+
geom_histogram()
Now, technically field goals don’t meet the definition I set out above as being a continuous variable because they aren’t divisible below a certain amount. Usually in practice though we just ignore this– this variable is “as good as” continuous, given that it varies smoothly over the range and isn’t confined to a relatively small set of possible values.
Quick Exercise: Do the same thing for field goal percentage and think about what kind of variable it is.
# INSERT CODE HERE
The mean is used most of the time for continuous variables, but it’s VERY sensitive to outliers. The median (50th percentile) is usually better, but it can be difficult to explain to general audiences.
nba%>%
summarize(mean_fgm=mean(fgm))
## # A tibble: 1 × 1
## mean_fgm
## <dbl>
## 1 191.
nba%>%
summarize(median_fgm=median(fgm))
## # A tibble: 1 × 1
## median_fgm
## <dbl>
## 1 157
In this case I’d really prefer the mean as a single measure of field goal production, but depending on the audience I still might just go ahead and use the median.
Quick Exercise What measure would you prefer for field goal percentage? Calculate that measure.
# INSERT CODE HERE
Let’s take a look at player seasons.
nba%>%
select(namePlayer,numberPlayerSeason)%>%
arrange(-numberPlayerSeason)%>%
print(n=50)
## # A tibble: 530 × 2
## namePlayer numberPlayerSeason
## <chr> <dbl>
## 1 Vince Carter 20
## 2 Dirk Nowitzki 20
## 3 Jamal Crawford 18
## 4 Tony Parker 17
## 5 Tyson Chandler 17
## 6 Pau Gasol 17
## 7 Nene 16
## 8 Carmelo Anthony 15
## 9 Udonis Haslem 15
## 10 LeBron James 15
## 11 Zaza Pachulia 15
## 12 Dwyane Wade 15
## 13 Kyle Korver 15
## 14 Luol Deng 14
## 15 Devin Harris 14
## 16 Dwight Howard 14
## 17 Andre Iguodala 14
## 18 JR Smith 14
## 19 Trevor Ariza 14
## 20 Andrew Bogut 13
## 21 Jose Calderon 13
## 22 Raymond Felton 13
## 23 Amir Johnson 13
## 24 Shaun Livingston 13
## 25 Chris Paul 13
## 26 Marvin Williams 13
## 27 Lou Williams 13
## 28 CJ Miles 13
## 29 LaMarcus Aldridge 12
## 30 J.J. Barea 12
## 31 Channing Frye 12
## 32 Rudy Gay 12
## 33 Kyle Lowry 12
## 34 Paul Millsap 12
## 35 JJ Redick 12
## 36 Rajon Rondo 12
## 37 Thabo Sefolosha 12
## 38 Marco Belinelli 11
## 39 Mike Conley 11
## 40 Kevin Durant 11
## 41 Jared Dudley 11
## 42 Marcin Gortat 11
## 43 Gerald Green 11
## 44 Al Horford 11
## 45 Joakim Noah 11
## 46 Thaddeus Young 11
## 47 Nick Young 11
## 48 Corey Brewer 11
## 49 D.J. Augustin 10
## 50 Jerryd Bayless 10
## # ℹ 480 more rows
Looks like it might be continuous? Let’s plot it:
nba%>%
ggplot(aes(x=numberPlayerSeason))+
geom_histogram(binwidth = 1)
Nope. See how it falls into a small set of possible categories? This is an ordered categorical variable. That means we should calculate the proportions in each category
nba%>%
group_by(numberPlayerSeason)%>%
count(name="total_in_group")%>%
ungroup()%>%
mutate(proportion=total_in_group/sum(total_in_group))
## # A tibble: 20 × 3
## numberPlayerSeason total_in_group proportion
## <dbl> <int> <dbl>
## 1 0 105 0.198
## 2 1 89 0.168
## 3 2 56 0.106
## 4 3 43 0.0811
## 5 4 37 0.0698
## 6 5 33 0.0623
## 7 6 31 0.0585
## 8 7 25 0.0472
## 9 8 19 0.0358
## 10 9 20 0.0377
## 11 10 24 0.0453
## 12 11 11 0.0208
## 13 12 9 0.0170
## 14 13 9 0.0170
## 15 14 6 0.0113
## 16 15 6 0.0113
## 17 16 1 0.00189
## 18 17 3 0.00566
## 19 18 1 0.00189
## 20 20 2 0.00377
What does this tell us?
Quick Exercise Create a histogram for player age. What does that tell us about the NBA?
# INSERT CODE HERE
Let’s take a look at the variable for Rookie season.
nba%>%select(namePlayer,isRookie)
## # A tibble: 530 × 2
## namePlayer isRookie
## <chr> <lgl>
## 1 LaMarcus Aldridge FALSE
## 2 Quincy Acy FALSE
## 3 Steven Adams FALSE
## 4 Alex Abrines FALSE
## 5 Bam Adebayo FALSE
## 6 Rawle Alkins TRUE
## 7 Grayson Allen TRUE
## 8 Deng Adel TRUE
## 9 Jaylen Adams TRUE
## 10 DeVaughn Akoon-Purcell TRUE
## # ℹ 520 more rows
Okay, so that’s set to a logical. In R, TRUE or FALSE are special values that indicate the result of a logical question. In this it’s whether or not the player is a rookie.
Usually we want a binary variable to have at least one version that’s structured so that 1= TRUE and 2=FALSE. This makes data analysis much easier. Let’s do that with this variable.
This code uses ifelse to create a new variable called
isRookiebin that’s set to 1 if the isRookie
variable is true, and 0 otherwise.
nba<-nba%>%
mutate(isRookie_bin=ifelse(isRookie==TRUE,1,0))
Now that it’s coded 0,1 we can calculate the mean, which is the same thing as the proportion of the players that are rookies.
nba%>%summarize(mean=mean(isRookie_bin))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.198
Let’s take a look at which college a player attended, which is a good
example of an unordered categorical variable. The org
variable tells us which organization the player was in before playing in
the NBA.
nba%>%
select(org)%>%
glimpse()
## Rows: 530
## Columns: 1
## $ org <fct> Texas, NA, Other, FC Barcelona Basquet, Kentucky, NA, Duke, NA, NA…
This look like team or college names, so this would be a categorical variable. Let’s take a look at the counts of players from different organizations:
nba%>%
group_by(org)%>%
count()%>%
arrange(-n)%>%
print(n=50)
## # A tibble: 68 × 2
## # Groups: org [68]
## org n
## <fct> <int>
## 1 <NA> 157
## 2 Other 85
## 3 Kentucky 25
## 4 Duke 17
## 5 California-Los Angeles 15
## 6 Kansas 11
## 7 Arizona 10
## 8 Texas 10
## 9 North Carolina 9
## 10 Michigan 8
## 11 Villanova 7
## 12 Indiana 6
## 13 Southern California 6
## 14 Syracuse 6
## 15 California 5
## 16 Louisville 5
## 17 Ohio State 5
## 18 Wake Forest 5
## 19 Colorado 4
## 20 Connecticut 4
## 21 Creighton 4
## 22 FC Barcelona Basquet 4
## 23 Florida 4
## 24 Georgia Tech 4
## 25 Michigan State 4
## 26 Oregon 4
## 27 Utah 4
## 28 Washington 4
## 29 Wisconsin 4
## 30 Boston College 3
## 31 Florida State 3
## 32 Georgetown 3
## 33 Gonzaga 3
## 34 Iowa State 3
## 35 Marquette 3
## 36 Maryland 3
## 37 Miami (FL) 3
## 38 North Carolina State 3
## 39 Notre Dame 3
## 40 Oklahoma 3
## 41 Purdue 3
## 42 Southern Methodist 3
## 43 Stanford 3
## 44 Tennessee 3
## 45 Virginia 3
## 46 Anadolu Efes S.K. 2
## 47 Baylor 2
## 48 Butler 2
## 49 Cincinnati 2
## 50 Kansas State 2
## # ℹ 18 more rows
Here we have a problem. If we’re interested just in colleges, we’re going to need to structure this a bit more. The code below filters out three categories that we don’t want: missing data, anything classified as others, and sports teams from other countries. The last is incomplete– I probably missed some! If I were doing this for real, I would use a list of colleges and only include those names.
What I do below is to negate the str_detect variable by
placing the ! in front of it. This means I want all of the
cases that don’t match the pattern the supplied. The pattern makes heavy
use of the OR operator |. I’m saying I don’t want to
include players whose organization included the letters CB
r KK and so on (these are common prefixes for sports
organizations in other countries, I definitely did not look that up on
Wikipedia. Ok, I did.).
nba%>%
filter(!is.na(org))%>%
filter(!org=="Other")%>%
filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
group_by(org)%>%
count()%>%
arrange(-n)%>%
print(n=50)
## # A tibble: 57 × 2
## # Groups: org [57]
## org n
## <fct> <int>
## 1 Kentucky 25
## 2 Duke 17
## 3 California-Los Angeles 15
## 4 Kansas 11
## 5 Arizona 10
## 6 Texas 10
## 7 North Carolina 9
## 8 Michigan 8
## 9 Villanova 7
## 10 Indiana 6
## 11 Southern California 6
## 12 Syracuse 6
## 13 California 5
## 14 Louisville 5
## 15 Ohio State 5
## 16 Wake Forest 5
## 17 Colorado 4
## 18 Connecticut 4
## 19 Creighton 4
## 20 Florida 4
## 21 Georgia Tech 4
## 22 Michigan State 4
## 23 Oregon 4
## 24 Utah 4
## 25 Washington 4
## 26 Wisconsin 4
## 27 Boston College 3
## 28 Florida State 3
## 29 Georgetown 3
## 30 Gonzaga 3
## 31 Iowa State 3
## 32 Marquette 3
## 33 Maryland 3
## 34 Miami (FL) 3
## 35 North Carolina State 3
## 36 Notre Dame 3
## 37 Oklahoma 3
## 38 Purdue 3
## 39 Southern Methodist 3
## 40 Stanford 3
## 41 Tennessee 3
## 42 Virginia 3
## 43 Baylor 2
## 44 Butler 2
## 45 Cincinnati 2
## 46 Kansas State 2
## 47 Louisiana State 2
## 48 Memphis 2
## 49 Missouri 2
## 50 Murray State 2
## # ℹ 7 more rows
That looks better. Which are the most common colleges and universities that send players to the NBA?
Quick Exercise Arrange the number of players by team in descending order.
# INSERT CODE HERE
There are two conference in the NBA, eastern and western. Let’s take a look at the variable that indicates which conference the payer played in that season.
nba%>%select(idConference)%>%
glimpse()
## Rows: 530
## Columns: 1
## $ idConference <int> 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, …
It looks like conference is structured as numeric, but a “1” or a “2”. Because it’s best to have binary variables structured as “has the characteristic” or “doesn’t have the characteristic” we’re going to create a variable for western conference that’s set to 1 if the player was playing in the western conference and 0 if the player was not (this is the same as playing in the eastern conference).
nba<-nba%>%
mutate(west_conference=ifelse(idConference==1,1,0))
Once we’ve done that, we can see how many players played in each conference.
nba%>%
summarize(mean(west_conference))
## # A tibble: 1 × 1
## `mean(west_conference)`
## <dbl>
## 1 0.508
Makes sense!
Quick Exercise:* create a variable for whether or not the player is from the USA. Calculate the proportion of players from the USA in the NBA. The coding on country is … decidedy US-centric, so you’ll need to think about this one a bit.
# INSERT CODE HERE
Ok, now that we know how this works, we can do some summary analysis. First of all, what does the total number of field goals made look like by college?
We know that field goals are continuous (sort of) so let’s summarize them via the mean. We know that college is a categorical variable, so we’ll use that to group the data. This is one of our first examples of a conditiona mean, which we’ll use a lot.
nba%>%
filter(!is.na(org))%>%
filter(!org=="Other")%>%
filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
group_by(org)%>%
summarize(mean_fg=sum(fgm))%>%
arrange(-mean_fg)%>%
print(n=50)
## # A tibble: 57 × 2
## org mean_fg
## <fct> <dbl>
## 1 Kentucky 6594
## 2 Duke 4623
## 3 Texas 3437
## 4 California-Los Angeles 3382
## 5 Kansas 2765
## 6 Arizona 2101
## 7 Oklahoma 1767
## 8 Southern California 1758
## 9 Louisville 1679
## 10 North Carolina 1659
## 11 Indiana 1522
## 12 Ohio State 1486
## 13 Michigan 1392
## 14 Wake Forest 1364
## 15 Connecticut 1299
## 16 Villanova 1222
## 17 Georgia Tech 1169
## 18 Tennessee 1095
## 19 Stanford 949
## 20 Utah 943
## 21 Marquette 873
## 22 Gonzaga 863
## 23 Michigan State 820
## 24 Colorado 818
## 25 Virginia 816
## 26 Maryland 811
## 27 Missouri 756
## 28 California 734
## 29 Florida State 733
## 30 Georgetown 717
## 31 Memphis 620
## 32 Florida 618
## 33 North Carolina State 598
## 34 Boston College 586
## 35 Louisiana State 583
## 36 Syracuse 567
## 37 Iowa State 523
## 38 Butler 459
## 39 Wisconsin 456
## 40 Creighton 432
## 41 Oregon 352
## 42 Texas A&M 322
## 43 Baylor 312
## 44 Providence 291
## 45 Purdue 275
## 46 Notre Dame 263
## 47 Ulkerspor 252
## 48 Southern Methodist 246
## 49 Oklahoma State 242
## 50 West Virginia 236
## # ℹ 7 more rows
Next, what about field goal percentage?
nba%>%
filter(!is.na(org))%>%
filter(!org=="Other")%>%
filter(!str_detect(org,"CB|KK|rytas|FC|B.C.|S.K.|Madrid"))%>%
group_by(org)%>%
summarize(mean_ftp=mean(pctFT))%>%
arrange(-mean_ftp)%>%
print(n=50)
## # A tibble: 57 × 2
## org mean_ftp
## <fct> <dbl>
## 1 Tennessee 0.842
## 2 Virginia 0.833
## 3 Oklahoma 0.823
## 4 North Carolina State 0.817
## 5 West Virginia 0.804
## 6 Ulkerspor 0.803
## 7 Missouri 0.802
## 8 Wake Forest 0.802
## 9 Florida State 0.801
## 10 Murray State 0.798
## 11 Iowa State 0.795
## 12 Notre Dame 0.792
## 13 Memphis 0.788
## 14 Florida 0.784
## 15 Michigan 0.783
## 16 Stanford 0.779
## 17 Georgetown 0.775
## 18 Marquette 0.774
## 19 Utah 0.770
## 20 Kansas State 0.767
## 21 Butler 0.762
## 22 Gonzaga 0.761
## 23 North Carolina 0.756
## 24 Villanova 0.755
## 25 Texas 0.752
## 26 Connecticut 0.748
## 27 Providence 0.747
## 28 Boston College 0.742
## 29 Michigan State 0.730
## 30 Kansas 0.729
## 31 Indiana 0.729
## 32 Duke 0.728
## 33 Baylor 0.726
## 34 Arizona 0.721
## 35 Pallacanestro Biella 0.718
## 36 Wisconsin 0.712
## 37 Kentucky 0.712
## 38 Georgia Tech 0.712
## 39 Louisiana State 0.709
## 40 Creighton 0.698
## 41 Maryland 0.695
## 42 Vanderbilt 0.688
## 43 Washington 0.680
## 44 Louisville 0.679
## 45 Ohio State 0.679
## 46 California 0.675
## 47 Southern Methodist 0.673
## 48 Oregon 0.662
## 49 Texas A&M 0.652
## 50 Southern California 0.648
## # ℹ 7 more rows
Quick Exercise Calculate field goals made by player season.
# INSERT CODE HERE
Quick Exercise Calculate free throw percent made by player season.
# INSERT CODE HERE
Note that in this instance we could also use
arrange(-n) to sort in descending order. We use
desc because it is slightly more generic (i.e., if we tried
to use arrange(-SEX) above it would not work).↩︎